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Abstract 

In this paper, we propose a multivariate quantile regression method which enables local¬ 
ized analysis on conditional quantiles and global comovement analysis on conditional ranges for 
high-dimensional data. The proposed method, hereafter referred to as FActorisable Sparse Tail 
Event Curves, or FASTEC for short, exploits the potential factor structure of multivariate con¬ 
ditional quantiles through nuclear norm regularization and is particularly suitable for dealing 
with extreme quantiles. We study both theoretical properties and computational aspects of the 
estimating procedure for EASTEC. In particular, we derive nonasymptotic oracle bounds for 
the estimation error, and develope an efficient proximal gradient algorithm for the non-smooth 
optimization problem incurred in our estimating procedure. Merits of the proposed methodol¬ 
ogy are further demonstrated through applications to Conditional Autoregressive Value-at-Risk 
(CAViaR) (Engle and Manganelli; 2004), and a Chinese temperature dataset. 

Keyword: High-dimensional data analysis, multivariate quantile regression, quantile regression, 
value-at-risk, nuclear norm, multi-task learning. 
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1. Introduction 

High-dimensional multivariate quantile analysis is crucial for many applications, such as risk 
management and weather analysis. In these applications, quantile functions gy(T) of random 
variable Y such that PjV < gy(T)} = r at the ’’tail” of the distribution, namely at r close 0 or 
1, such as T = 1%,5% or T = 95%, 99%, is of great interest. This is because the quantile at level 
r can be interpreted as the lower (upper) bound with conhdence level 1 — r (r) of the possible 
outcome of a random variable, which can assist the process of decision making for treatment or risk 
management. Some practical examples: 

• Financial risk management: quantiles gy(T) of asset return with small r indicates the lower 
bound of the potential loss, which is of interest of both risk manager and market regulator. 
In particular, the quantile of asset return with r = 1% is called ”value-at-risk”. At the same 
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time, this is a high-dimensional problem as there are often several hundreds or thousands of 
asset returns to be considered. 

• Temperature analysis: quantiles at high and small r give the range of possible temperature 
variation, which is useful for crop growth or studying climate change. There may be hundreds 
of weather stations depending on the size of the region being considered. 

A global analysis in the behavior of dispersion of high-dimensional random variables can be done 
based on the observation that the difference of the quantile pair {q{T),q{l — r)) gives a flavor of 
range, which we refer as T-range. For example r = 25% gives the interquartile range, which is known 
to be a robust measure of distribution dispersion. The terminology global refers to the analysis 
of the pattern of dispersion of variables, which should be distinguished from the localized analysis 
specialized at a quantile level. While the factors for each of the two quantile allows for modeling 
asymmetry of distribution, we can detect asymmetric change of the range of the variables, such as 
expanding, shrinking, shifting, or shifting while expanding/shrinking, by the sign of loadings and 
the trend of the factors. 

Most previous data analysis method for high-dimensional data emphasizes on the variance and 
covariance structure of the high-dimensional data, and methods based on that such as principal 
component analysis can describe the linear dependence in variables when the data are symmetric, 
in similar scale and no outliers. However, knowing the linear dependence of the random variables 
does not lead to the knowledge in their lower and/or upper bounds. Moreover, for non-Gaussian 
and highly asymmetric (skewed) data, the methods based on covariance structure can be highly 
corrupted if no correction is made. 

To see that the information from the covariance and quantiles are not much related, we analayse 
data simulated from an asymmetric model. The data are simulated with 

Yij = < 0.5), 3 = 1,..., 100, 

= j = 101,...,200, 

for i = 1,...,500, where {Wj} are i.i.d. from a joint uniform [0,1] distribution with Xi G 
{Uij} are i.i.d. uniform [ 0 , 1 ] over both i and j. Ti^^j and T2^*j are j column vector of matrices 
ri,r 2 G which are of rank 2 and p = m = 200. d>(-) is the cdf of standard Gaussian 

distribution. Gonditioning on Xi, Yij is independent over j. Notice that the distribution of Yij is 
highly asymmetric and skewed, since the first 100 variables are essentially negative and the last 100 
are nonnegative. Moreover, the distribution of Yij is not continuous, since there is nonzero density 
mass ( 1 / 2 ) at 0 . 

The left figure of Figure 1.2 is the biplot of PGA on the matrix Y = (Yij), which suggests 
that Y42 and Yi are different variables, and I 42 seems to be negatively associated with Yi and is 
perpendicular to Y 142 . However, the quantile based factor analysis (our method) classifies the data 
with respect to the behavior of their quantiles at the tail (r = 1 %, 99%) of the distribution. As 
the first 100 random variables are similar in their tail behavior (bounded by 0 above), they all lie 
horizontally close to the x-axis, while the last 100 variables are lying vertically close to the y-axis. 
The reason for such phenomenon is that PGA takes a centralized view and looks at the covariance 
Cov{Yij,Yii^) for j 7^ k, and based on ( 1 . 1 ), the inner product of vectors and P^fc plays a big 
role in it. 

Our method, however, looks at the dispersion of the data Yij from an uncentralized view. From 
the factors and factor loadings in both figures of Figure 1.3, the pattern of change in quantiles at 
1% and 99% and in r-range can be determined. Furthermore, in a classification perspective, the 
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Figure 1.1: The variable simulated by (1.1). The left is Yi bounded above by 0 and the left is Yioi 
bounded below by 0. 


440 

430 241^9 412^2 


3^3 



- 1 - 1 - 

- 0.05 0.00 

Comp.1 


Figure 1.2: The PCA biplot on data Y. PCA is based on the covariance and does not capture the 
pattern in the quantiles of the distribution. 


variables close with each other on the right of Figure 1.3 have similar pattern in the change of the 
r-range. 

In this paper, we estimate the conditional quantile for high-dimensional data with covariates 
which is factorisable. This method allows for the global analysis of r-range and localized analysis of a 
specihc quantile of high-dimensional data, and is more robust to outliers and is capable of capturing 
the asymmetric distributional dispersion in the data. The key intermediate step of implementation 
is to estimate conditional quantiles for multivariate responses, which is done via the nuclear norm 
regularized multivariate quantile regression (MQR), in which the we factorise the covariates and 
then using the factors to interpret the data. To handle high-dimensional data, we assume that the 
coefficient matrix is of low rank. The detail is discussed in later sections. 

The low-rank regression has been applied to handle the problem of overparametrization and 
sparse sample size. Reduced-rank multivariate regression is of interest in a wide variety of science 
fields for cross-sectional data. The earliest work dates back to Anderson (1951) in which the 
relation between a set of macroeconomic variables and set of manipulable noneconomic variables was 
considered. Izenman (1975) formally introduced the term ’’reduced-rank regression” and anlaysed 
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Figure 1.3: The first factor of 1% and 99% quantiles of data Y(left) and the factor loadings(right). 
Variables have close distance on the right hgure have similar change in r-range, r = 1%. 


the model in detail. For more historical accounts, see Reinsel and Vein (1998) among others. The 
multivariate regression problem focuses on the expected values of the conditional distributions of 
m response variables, given p-dimensional covariates. The reduced-rank multivariate regression 
factorizes the covariates into a parsimonious group of r factors, which decompose the variation 
of the conditional expectations of the response variables and improve the interpretability of the 
cross-sectional data. 

The estimation of the conditional quantiles with low rank covariate matrix involves minimiza¬ 
tion of the empirical loss based on the ’’check function” of Koenker and Bassett (1978), with an 
additional regularization term of nuclear norm. Our model is equivalent to a multi-task quantile 
regression with low-rank structure. Fan et al. (2013) also consider multi-task quantile regression 
under transnormal model. 

Our contributions are summarized as follows: 

1. The factor model for the quantiles of cross-sectional data is proposed; 

2. A method of estimation is designed for the nuclear norm regularized non-smooth empirical 
loss function and its efficiency is 0(l/e) where e is a given accuracy level; 

3. The nonasymptotic risk bounds for the multivariate quantile regression are derived and are 
illustrated by numerical analyses; 

4. A CAViaR modification for financial risk management is demonstrated. 

5. A nonparametric curve model is considered for quantile curves and applied on temperature 
data. 

The modihcation of the Conditional Autoregressive Value-at-Risk (CAViaR) model of Engle and 
Manganelli (2004) leads to a Sparse Asymmetric Multivariate Conditional Value-at-Risk (SAMC- 
VaR) model. It can be viewed as a multiple factor version of White et al. (2015), but there is no 
need to identify the factors nor specifying the number of the factors. We apply SAMCVaR to a 
dataset consisting of banks, insurance companies and financial service firms from around the world 
between mid 2007 to mid 2010, including the period of financial crisis. Our first finding is the neg¬ 
ative leverage effect, in the sense that loss leads more to the drop of lower quantile factor than the 
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rise of upper quantile factor, which is a step further of the classical result that only suggests the loss 
leading to higher dispersion of the distribution. Moreover, we show the main risk drivers and risk 
sensitive firms in the crisis period after the beginning of year 2009. Nonparametric quantile curve 
model is an extension for the linear multivariate quantile regression model. Using the temperature 
data, we show that the quantile curve model discriminates the two extreme temperature types in 
China very well. 

1.1. Related work 

Multivariate quantile regression is studied under several different frameworks by previous au¬ 
thors, but none of them considered high-dimensional case. Serfling (2002) gives a survey of this 
research direction. Suppose the samples (Xi, Yf),..., (X„, Y„) are i.i.d. copies of (X, Y) in MP+™. 
Koenker and Portnoy (1990) suggested M-estimation in multiresponse linear regression model with 
weighting matrix. The estimator has an efficient covariance structure, but the estimator fails to 
be affine equivariant. Chaudhuri (1996), Koltchinskii (1997), and Chakraborty (2003) consider the 
geometric quantile, which is the minimizer 

arg min I V||Y-S^Xi||+u^(Y-S’^Xi)l, (1.2) 

^ 2 = 1 

where u G = {v G M"* : ||v|| < 1} controls the direction of deviation from the center of the 

data cloud and ||u|| measures the magnitude of the deviation; particularly, ||u|| = 0 corresponds 
to the median of the data cloud and ||u|| close to 1 corresponds to the tail of the distribution. 
Another line of literature tries to link quantile regression and data depth of Tukey (1975). Kong 
and Mizera (2012) estimate quantile halfspace by first projecting data on an oriented straight line 
with unit vector u, and then finding the quantile hyperplane which is perpendicular to the vector 
u and coincides with the line at the quantile of the projected data. The quantile halfspace is the 
space lying above the hyperplane. They show that their quantile halfspace correspond to Tukey’s 
halfspace depth at each chosen unit vector u. However, in practice their method cannot be used to 
construct the halfspace depth, because that would require estimating uncountably many quantile 
spaces. Hallin et al. (2010) propose a novel estimation method quantile halfspaces, and show that 
the upper envelop of the resulting upper quantile halfspaces coincides with Tukey’s halfspace depth 
and is computable. Asymptotic properties including a Bahadur representation are also established 
in this paper. 

High-dimensional multivariate regression (MR) has been extensively studied in recent years, 
though the non high-dimensional MR has been around for decades. We review some key ingredients 
of this model. Suppose 


Yi — , ( 1 - 3 ) 

where the entries of are independent with mean 0. In order to recover the matrix T, assuming 
that £i ~ X(0, Yj,), one minimizes the loss (or negative log likelihood) tr [(Y — XS)n(Y — XS)"''] 
with respect to matrix S, where 12 is a weighting matrix. Common choices are fl = and 
Im, while the former choice generates the efficient estimator and the later choice only guarantees 
consistency. An issue of this approach is that it neglects the dependency in the response variables 
in covariates X (heteroskedasticity). Another issue is overparametrization, since p and m can be 
large relative to n and one cannot hope to consistently estimate the model. To deal with these two 
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issues, Izenman (1975) proposed the reduced rank approach. For a predetermined integer r > 0, 
arg min tr [(Y — XS)ri(Y — XS)"''] s.t. rank(S) < r. 

SgRpXm 

The number of variables unknown is thus reduced to r <C max{p, m}. Reinsel and Vein (1998) gave 
an explicit review of this approach. 

In the traditional approach described above, r has to be determined ex-ante. In more recent 
developments, Yuan et al. (2007) proposed a penalization approach, in which they estimate the F 
matrix by minimizing: 


||Y-xr||F + A||r||„ (1.4) 

where A > 0 is a constant. They pointed out the connection between the reduced rank model 
and factor analysis and proved that an estimator F can be obtained by soft-thresholding the OLS 
estimator. Bunea et al. (2011) estimate F by minimizing ||Y — XF||f + Arank(F), and they show 
nonasymptotic risk bonnds for both their estimator and the estimator from minimizing (1.4). They 
also show that both estimators recover the rank of F with high probability. In high-dimensional 
setting, Negahban and Wainwright (2011) consider two cases that F is either exact low rank or 
near low rank. For both cases, they obtain nonasymptotic risk bounds for estimating the trne F 
with nuclear norm penalized estimator F. Negahban et al. (2012) present a unified framework for 
analyzing high-dimensional M-estimator with differentiable convex loss functions and decomposable 
penalizing term. Although the nuclear norm is decomposable, the asymmetric absolnte loss fnnction 
for estimating conditional quantiles is not differentiable and cannot be minorized with a quadratic 
function, so that the framework of Negahban et al. (2012) cannot be directly applied to our problem. 

For high-dimensional multi-task quantile regression, Fan et al. (2013) consider the problem 
under a transnormal model. They estimate transformations of independent variables which si¬ 
multaneously explain the qnantile of each response variable and make the joint distribution of 
transformed covariates and response Gaussian. Comparing to their work, onr method assnmes 
low-rank structure, but we do not impose any distribntion assumption. 

1.2. Organization of the paper 

The remaining part of this paper is organized as follows. In Section 2 we discuss the factorisation, 
and its similarity to the estimation of factors in traditional factor models. Section 3 is devoted 
to the algorithm for solving the optimization problem and analyzing the convergence property 
of the algorithm. The tuning procedure is also explained in this section. In Section 4 the oracle 
properties of our estimator are investigated. A Monte Carlo simulation study is presented in Section 
5. Section 6 is devoted to applying onr technique to the estimation of SAMCVaR. Empirical results 
are presented. Section 7 discnss a nonparametric estimation of multivariate quantile curves, which 
again can be factorised into factor curves. A real data application on Chinese temperature data is 
also presented. Detailed proofs are shifted to the appendix. 

1.3. Notations 

def 

The following notations are adopted throughout this paper. Given two scalars x and y, x Ay = 
min{x, y} and xVy max{x, y}. l(x < 0) is an index fnnction, which is equal to 1 when x < 0 and 
0 when x > 0. For a vector v = (xi, ...,Vp) G let ||v ||2 = (X]j=i and ||v||oo = maxj<p |xjj 

be the vector £2 and infinity norm. For a matrix A = (Aij) G given the singular values of 
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A: o-i(A) > cr 2 (A) > ... > crpAm(A), let ||A|| = o-i(A), ||A||* = (Tj(A) 

and II A||f = \J = tr(AA^)^/2 = tr(A^A)^/2 = {Ylj=i YJk=i and be the 

matrix spectral norm, nuclear norm (or trace norm), Frobenius norm. The jth column vector of A 
is denoted by A*j. Similarly, the ith row vector of A is denoted by Aj*. The minimal and maximal 
singular values of A is denoted by crmin(A) and (Tmax(A). Ip denotes the pxp identity matrix, and 
1 denotes the matrix with all entries equal to 1 . (•, •) : x M denotes the trace inner 

product given by (A, B) = tr(AB^). For a function / : —>• M, and Zi G MP, define the empirical 
process Gn{f(Zi)) = Er=i{/(-^*) “ E[/(-^i)]}. 

Definition 1.1 (Sub-Gaussian variable and sub-Gaussian norm). A random variable X is called 
sub-Gaussian if there exists some positive constant K 2 such that P(|A| > t) < exp(l — t^/A|) for 
all t > 0. The sub-Gaussian norm of X is defined as ||X ||.02 = supp>]^p |X|^)^/^. 

2. Factorizable sparse multivariate quantile regression 

To motivate the estimation of factors in the quantile of a random variable, we first shortly 
review the classical linear factor model. Linear factor models, such as Capital Asset Pricing Model 
(CAPM) and Arbitrage Pricing Theory (APT), are popular in economics and finance for describing 
the relationship between asset returns and factors. The standard setting is 

Xij — tpjlFii T tpj2Fi2 T T '4^jrFir T (^'1) 

where Yi G M"* is a vector of asset returns, F)i,...,F)r are factors and Sij is the portion not related to 
the factors. Assumptions are Cov{Fik, Sij) = 0 for all A; = 1 ,..., r and j = 1 , ...,m, Cov{eij, su) = 0 
for all j ^ 1. Factors Fik can be viewed as hedging portfolios or macroeconomic drivers depending 
on the context. Note that the number of factor is exactly one in terms of CAPM. 

The linear factor model (2.1) can be estimated even when the factors are not identified ex-ante. 
The multivariate regression model can estimate the factors and loadings, if it is known that some 
exogenous macroeconomic variables Aj G are relevant to Taking conditional expectation 
to factor model ( 2 . 1 ) gives 


E [T,,- \Xi] = Y,^3k^[Fik\Xi], (2.2) 

k=l 

Suppose that E[F)_fc|Ai] = (pJXi, where = {pki, ■■■, ^kp)- We have the multivariate regression 
model 


E[Yi\X,]=TFXi, (2.3) 

where = (EEi Efc=i P can be estimated with a multivariate regres¬ 

sion model (1.3) with the rank of F being r. The benefit of considering such model is that this 
incorporates the cross-sectional information in Y^. This is closely related to multi-task learning 
paradigm in machine learning literature. Gibbons and Person (1985) was the first to present the 
model (2.3). One can also see Chapter 8 of Reinsel and Vein (1998) for detail. One remark is 
that for the traditional multivariate regression technique introduced in Reinsel and Vein (1998), 
the number of factor r is assumed to be known or has to be obtained via other method. However, 
using the advanced regularization method of Yuan et al. (2007), Bunea et al. (2011) or Negahban 
and Wainwright (2011), knowing r is not necessary for estimation. 
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One remark is that knowing F does not trivially yield the estimate for factors and factor loadings, 
because the decomposition of F = is not unique, in which $ corresponds to the factors and 
'J' corresponds to the factor loadings. The ideal decomposition requires $ to be a matrix with r 
nonzero columns, so that we have r factors, and T' is a unitary matrix. As pointed out in Section 
2 of Yuan et al. (2007), this can be done via singular value decomposition. Suppose the singular 
value decomposition of F is F = UDV^, where U and V are unitary matrices and D is rectangular 
diagonal matrix with kth diagonal element being the singular value and = 0 for k > r. The 

factor loadings ■j/’j = j* satisfies ||'i/’j ||2 = 1 for 1 < j < m. Letting $ = $ has only r 

nonzero rows. The factor is formed as Fik = 

Conditional quantile is of our focus. We estimate the quantile of response variables Yj, j = 
parametrically as (2.3). Let qj{T\Xi) be the conditional quantile of Yj conditional on 
Xi G for j = 1,..., m and f = 1,..., n, 

qj{T\Xi)=XjT,^{T), (2.4) 

where F*j is jth column of matrix F G which is assumed of low rank r <C min{p, m}. The 

model is posed in a high-dimensional setting: p, m —>■ oo while the sample size n —)> oo. 

Furthermore, model (2.4) is factorisable. Suppose the SVD of F is F = UDV"'^ and the number 
of nonzero singular values is r, similarly to (2.2), 

r 

q,{T\X,) = Y,V,,kfkm, (2.5) 

k=l 

where fj^{Xi) = cjfcUj)i,Xj. With slight abuse of terminology, we also call fJ^{Xi) ’’factors” with Vj^k 
being ’’factor loadings”. For mean regression (2.3), factorisation would give a factor model (2.1). 
In the practice of multi-task or multivariate quantile regression, factors are handy for classification 
and prediction. We will explore its power with real data in Section 6. 

To find an estimator F for F, quantile regression proposed by Koenker and Bassett (1978) allows 
to recover the conditional quantile of a univariate response. Our loss function 

fA(r)=^arg min i (mn)“^ ^ ^ (Yj - -k A||S||* i, (2.6) 

i=l j=l 

where Pt{u) = 'u('r — ^{u < 0}) and is jth column of matrix S. The first term controls the 
quality of fitting, which is similar to the loss function proposed in Koenker and Portnoy (1990). 
The second term nuclear norm regularization is applied to encourage the accurate estimation, as 
the rank of the matrix F is degenerate and is sparse. The quantity r is considered fixed in our 
discussion. 

Note that Pt{u) is not globally differentiable, where 0 < r < 1 is a given quantile level. The 
idea of solving (2.6) is first smoothing the loss function by the method of Nesterov (2005), and 
then applying the fast iterative proximal gradient algorithm of Beck and Teboulle (2009). It will 
be shown in Theorem 3.2 that our method achieves the efficiency of 0(l/e), where e is a given rate 
of accuracy, say I0“®. Nonasymptotic oracle properties of F are established in Section 4. 

3. Computation 

In this section, we discuss how the estimate defined by (2.6) can be computed efficiently. The 
procedure can be summarized in Algorithm I. The main result on efficiency of the algorithm is 


Theorem 3.2. 

The problem of solving a nonlinear program like (1.4) and (2.6) has received a lot of attention 
recently. One strand of literature using the proximal gradient approach, exploits the fact that 
the proximity operator of nuclear norm has a closed form, which performs soft-thresholding of the 
singular values of the input matrix. Such algorithm requires singular value decomposition (SVD) 
in each iteration, and this may be computationally expensive when the matrix is large. Ji and Ye 
(2009) and Toh and Yun (2010) propose algorithms in this line which obtain e-accurate solution 
in 0(l/yY) steps. A second strand of literature reformulates the optimization problem into a 
semidefinite program and then applies available solvers. Though traditional solvers such as SDPT3 
or SeDuMi are not suitable for high-dimensional data, Jaggi and Sulovsky (2010) constructed an 
algorithm based on the algorithm of Hazan (2008) and applied it on large datasets. This approach 
avoids performing SVD in each step, but in general it requires 0(l/e) steps to reach a e-accurate 
solution. 

Our algorithm follows the first line of proximal gradient algorithm. As in Jaggi and Sulovsky 
(2010) it is required that the loss function to be differentiable. In our simulation study we show 
that our algorithm is able to handle matrices with hundreds of rows and columns. 

A key difference between our problem to those studied in the articles mentioned above is that, 
beside the nuclear norm penalty term, the first term in our loss function in (2.6) is non-smooth, and 
this suggests that the direct application of proximal gradient algorithm may not generate desirable 
result. Therefore, there are two important questions one needs to answer: how to transform the 
problem so that it produces favorable properties and what is the price for such transformation? 
In what follows we will answer both questions by showing a procedure to smooth the non-smooth 
loss function and obtain the convergence rate of our algorithm. Our approach is inspired by 
Chen et al. (2012), who deal with sparse regression problem with non-smooth structured sparsity- 
inducing penalties. They apply the method of Nesterov (2005), who suggests a systematic way to 
approximate the non-smooth objective function by a function with Lipschitz continuous gradient. 
Our smoothing method is based on this idea as well. 

Recall that our goal is to minimize the following loss function: 

n m 

L{T) = {mn)~^ r,,-) +A||r||/=i^Q,(r) + A||r||„ (3.1) 

i=l j=l 

where Pt{u) = u(r — l{u < 0}) with given 0 < r < 1. 

QT-(r) is clearly non-smooth. To handle this problem, we introduce the dual variables Qij to 
rewrite as 

n m 

Qr(r) = niax (mn)-^ ^ ^ 0^ - Xj . (3.2) 

To see that this equation holds, note that for each pair of i, j, when Yij — Xj T^j > 0 , Qij = T since 
r is the largest ’’positive” value in the interval [r — 1, r]; when Yij — Xj T^j < 0, Qij = r — 1 since 
r is the smallest ’’negative” value in the interval [r — 1, r]. This verifies the equation. Observe that 
it is necessary to choose [r — l,r] rather than {r — l,r} for the support of Qij in order to satisfy 
the convex set conditions given in Nesterov (2005). Though both choices fulfill the equation, the 
previous one is an interval and therefore a convex set while the later one is not convex. This choice 
is the key to the smoothing approximation discussed later and will influence the gradient of the 
smoothed loss function. 
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The formulation of QT(r) given in (3.2) is still a non-smooth function of F, and this makes the 
subgradient based algorithm inefficient. To smooth this function, denote 0 = (0ij) the matrix of 
Oij, we consider the smooth approximation to QT-(r): 


Q. 


(F) = 


max 


0ije[T-i,T] 


mnl 


\-i 


£(F,0) 



(3.3) 


where £(r, 0) = ^"=1 ET=i ©T r*,) , and K > 0 is a smoothing regularization constant 

depending on m, n and the desired accuracy. When k —)• 0, the approximation is getting closer to 
the function before smoothing. We anlayse the convergence rate of our algorithm based on Theorem 
1 of Nesterov (2005). 

LEMMA 3.1. ^(F, 0) can be expressed as £(F, 0) = (-XF, 0) + (Y, 0). 

Since the function f ||0||p is strongly convex, the optimal solution 0*(F) for achieving (3.3) is 
unique for each F. We introduce a notation: for any matrix A = (A^), [[A]],- = ([[Ajj]],-) where 


[[Aij]]r 


'■ Aij, 


if Aij > r; 

if r — 1 < Aij < r; 

if Aij < r — 1. 


This function performs componentwise projection on a real matrix to the interval [r — l,r]. The 
next theorem presents properties of the (smooth) function (5r,K(F). 

THEOREM 3.1. For any k > 0, (5t,k(F) is well-defined, convex and continuously-differentiable 
function in F with the gradient VQt-,k(F) = —(mn)“^X^0*(F) G where 0*(F) is the 

optimal solution to (3.3), namely 


0*(F) = [[(Atmn)-i(Y-XF)]],. 


(3.4) 


The gradient V(5r,K(F) is Lipschitz continuous with the Lipschitz constant M = ^||X|p. 

By inserting (3.4) into the equation of VQt-,k(F), we arrive at the gradient which will be applied 
in our algorithm: 


VQ,,^(F) = -(mn)-iX^[[(/^mn)-i(Y - XF)]]^. (3.5) 

Observe that (3.5) is similar to the subgradient —X{r — 1(Y — XF < 0)} of (5r(F), where the 
operator r — !(• < 0) applies componentwise to the matrix Y — XF with a slight abuse of notation. 
The major difference lies in the fact that (3.5) replaces the discrete non-Lipschitz r — 1(Y— XF < 0) 
with a Lipschitz function [[k“^(Y — XF)]]^. Figure 3.1 illustrates this approximation property in 
a univariate framework with m = n = 1 and X = 1. Denote ipriu) = r — l(u < 0) the subgradient 
of Pt{u). The solid line pictures the function ipriu) with r = 0.5, which has a jump at the origin. 
The dashed line corresponds to the smoothing approximation gradient [[ft:“^(Y — XF)]],- associated 
with K = 0.5, which connects the discontinuous part and joins the function V't('w) when it reaches 
r the right end and r — 1 at the left end. As k decreases to 0.05, we observe that the smoothing 
approximation function is getting steeper around the origin and closer to pr- 

Let S\{-) be the proximity operator given in Theorem C.l. We state the main result of this 
section in Algorithm 1 for the optimization problem (2.6). The name of the algorithm reflects the 
fact that it is a combination of the smoothing procedure and the fast iterative shrinkage-thresholding 
algorithm (FISTA) of Beck and Teboulle (2009). 
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Figure 3.1; The solid line is the function 'ijjriu) = t — \{u < 0) with r = 0.5, which has a jump at 
the origin. The dashed line corresponding to the smoothing gradient [[k“^(Y — XT)]],- associated 
with K = 0.5. As K decreases to 0.05, we observe that the smoothing approximation function is 
closer to 'ipr{u). 


Algorithm 1: Smoothing fast iterative shrinkage-thresholding algorithm (SFISTA) 


IX 


| 2 . 


1 Input:^Y, X, A, k= M = ^^^1 

2 Initialization: Fq = 0, Hi = 0, step size (5i = 1; 

3 for t = 1, 2,..., T do 


^t+1 — - 2 -> 

7 end 

8 Output r = Fr 


The efficiency of Algorithm 1 is given by the following theorem. 

THEOREM 3.2 (Convergence analysis of Algorithm 1). Let {Ftj^g be the sequence generated 
by Algorithm 1, and F* be the optimal solution for minimizing (3.1). Then for any t and e > 0, 


|L(R)-L(F*)|< 


e(r V {1 
2 


r})" ^ 4||Fo-F*|||||Xf 

{t + lyemn 


(3.6) 


If we require L{Tt) — L{T*) < e, then 


t > 2- 


|F* - F 


OllF 


|X|| 


(rV{l-T})2 

2 


T - 


(3.7) 


REMARK 3.1. 1. The first term on the right hand side of (3.6) is related to the smoothing 


11 



















error, which cannot be made small by increasing the number of iteration, but can only be 
reduced by choosing a smaller smoothing parameter k. This is the price we pay for the 
smooth approximation. The second term is related to the fast iterative shrinkage-thresholding 
algorithm of Beck and Teboulle (2009). 


2. The original FISTA algorithm without smoothing yield the convergence rate 0{1/y/e). In our 
case, smoothing approximation error deteriorates the convergence rate and the best we can do 
is 0(l/e), which is comparable to the rate obtained by Nesterov (2005). As an improvement, 
our rate is still better than 0(l/e^) given by the general subgradient method. 


3. The quantile level r enters the numerical bound (3.6) by a factor ^1 
increases when r is getting close to the boundary of (0,1). 


2 


- 1/2 

, which 


For implementation, it is crucial to appropriately select A. In theory, one can select A based 
on (4.13) which gives the oracle result in Section 4, but the value does not adapt to the data very 
well. We propose a way to select A based on the ’’pivotal principle”, which are better adaptive to 
the data. 

Define the random variable 


A = (nm)“^||X^W||, (3.8) 

where Wij = l{Uij < 0) — r, {Uij} for i = 1, ...,n and j = 1, ...,m are i.i.d. uniform (0,1) random 
variables, independently distributed from the input variables Xi,..., The random variable A 
is pivotal conditioning on design X, as it does not depend on unknown parameter T. Notice that 
(nm)“^X''~W is the score V(5r(r). Set 


A = 2-A(l-a|X), (3.9) 

where A(1 — a|X) (1 — a)-quantile of A conditional on X, and c is an absolute constant. This is 
consistent with the pivotal principle applied in the high-dimensional quantile regression of Belloni 
and Chernozhukov (2011) and square-root Lasso Belloni et al. (2011). The choice of the statistics 
(3.8) is motivated by VQ(r), which plays a crucial role in oracle inequalities in Section 4. 

4. Oracle inequalities 

In this section we present the non-asymptotic oracle bounds of the estimator T defined in (2.6). 
The main results are Theorem 4.1 and Corollary 4.1, which are established through the convexity 
and geometric argument of Belloni and Chernozhukov (2011), concentration inequalities, and T-net 
arguments. 

Our risk bounds resemble the corresponding results of multivariate regression for mean, such 
as those in Negahban and Wainwright (2011) and Koltchinskii et al. (2011). We will compare our 
results to theirs in Remark 4.1. Koltchinskii (2013) presents an oracle inequality for excess risk on 
nuclear norm penalized convex empirical risk minimization. We cannot apply their result because 
our quantile loss function is not differentiable. In a novel paper, Belloni and Chernozhukov (2011) 
develop theory for high-dimensional Lasso estimator of non-multivariate regression for quantiles. 
The idea to prove their main theorem is very general and can be adapted to our case of multivariate 
regression for quantiles. However, some technical properties still need to be established before their 
method can be applied. 
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Let {Xi,Yi), ...,{Xn,Yn) be i.i.d. copies of random vectors in Recall Pt{u) = 

u{t — l{u < 0}) and its subgradient ipriu) = t — l{u < 0), and that F is defined as (2.6). Recall 
also the empirical loss 

n m 

Qr{S) = (nm)-i J2Y.P- 

i=i j=i 

and its expectation Qt{S). We define F be the minimizer of (5r(S), and the difference A = F — F. 
The subgradient for the empirical loss function <3r(F) is the matrix 

n 

VQr{T) = {nm)-^^XiW^ = (nm)-^X^W E , 

i=l 

where X is the design matrix and 

Wi (l{Y^j - Xj F*j < 0) - r) , W = [W^i,..., Wn]^ E 

In what follows we generalize the support of vector to matrix by projections. If A E is 

of rank r, and the singular value decomposition of A is A = o'(A)ujvJ with orthogonal 

vectors ui,...,Ur E and vi,...,Vr E the support of A is defined by { 81 , 82 ) in which 
= spanjui, ...,Ur} and 82 = spanjvi,..., v^}. We define the projection matrix on 5i by Pi = 
U^(UjU^)-^Uj =UrUj in which is a p x r matrix whose columns are formed by {ui,..., u^}, 
and U^TUr = R because {ui, ...,Ur} is an orthonormal basis. Similarly, P 2 = V^Vj. On the other 
hand, define the orthogonal projection of Pi and P 2 by Pj*- and P^. For any matrix S E 
we define the projections: 


Pa(S) =^S-P^SP^; Pi;(S) P^SP^. 


Define the cone 


/C(F;co) = {seM^^™: ||Pi(S)||* <co||Pa(S)||*}. (4.1) 

Assumption 4.1 (Sampling setting). Samples {Xi,Yi), ...,{Xn,Yn) are i.i.d. copies of (X,!^) 
random vectors in {t\x) = £c''^F*j(r). Conditioning on Xi, Yij is independent in j. 

Assumption 4.1 postulates that the data are i.i.d and there is no cross-sectional dependence 
in Yii, ...,yim conditioning on Xj. This suggests that all dependency in the components of R is 
captured by the covariates Xj. This assumption is stronger than that usually required for factor 
models, for which uncorrelatedness is often sufficient. 

Assumption 4.2 (Covariance matrix condition). Let the covariance matrix of X be Xx, as¬ 
sume that 0 < (Tmin(Xx) < <7max(Xx) < oo. Moreover, assume the sample covariance matrix of 
covariates Sv = -X^X satisfies 

^ n 

P[o’min(Xx) ^ Ci(Tniin(Sx)) Crmax(Xx) ^ C 2 Crmax(Xx)] ^ 1 Tn- {^■‘^) 

When the covariates come from a joint p-Gaussian distribution X(0,Xx), Lemma C.3 shows 
that (4.2) holds with ci = 1/9, C 2 = 9 and 7 ^ = 4exp(—n/2). 

Assumption 4.3 (Conditional density condition). There exist / > 0 and /' < 00 such that 
\^fYi,\xM\^)\ ^ f and infj<minfai/y,,|x,(a;’^F*i|a;) > /, where /y-^iXi is the conditional den- 
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sity function of Yij on Xi. 

Similar condition as Assumption 4.3 can be found in Belloni and Chernozhukov (2011). The 
quantity / controls the curvature of the population loss function, which can influence the estimation 
error. Negahban et al. (2012) give an extensive account on this issue. 

Assumption 4.4 (Restricted eigenvalue and nonlinearity). For a given probability distribution 11 
for X, 


/3r,3 = inf {/I > 0 : /3||iPr(A)||F < ||A|U,(n), VA E /C(r,3)} > 0, 

- > 0 , 


def 3 / . 

v = mf 


IAI|3 


8/' AG/C(r,3) m-1 E[|X7A 

•' a4o LI i 


*j\ 


(4.3) 

(4.4) 


where ||S|li 2 (n) ^ ^ ||S'''Xj|||. 

The cone /C(r,3) appears often in Lasso literature, for example in Bickel et al. (2009) and 
Negahban and Wainwright (2011) among others. Similar assumption on the existence of constant 
/ 3 r ,3 can also be found in Negahban and Wainwright (2011) and Koltchinskii et al. (2011). From 
Assumption 4.2 and the fact that ||'Pr(A)||F < ||A||f, we have a rough lower bound / 3 r ,3 > 

The restricted nonlinearity constant v is proposed by Belloni and Chernozhukov (2011), which 
is used to control the quality of minorization given in Lemma 4.2 (i). Section 2.5 of Belloni and 
Chernozhukov (2011) calculate v for various data generating processes under different design. 

The following lemma asserts that the empirical error F — F lies in the cone JC{T, 3). The proof 
can be found in Section B.l 

LEMMA 4.1. Suppose A > 2||Vg(r)|| and A = f - F. Then ||R7(A)||* < 3||Rr(A)||*. That is, 
A e/C(F,3). 

The next lemma characterizes useful properties which will be used later. The proof can be 
found in Section B.2. 

LEMMA 4.2. Under Assumptions 4.3 and 4.4, we have 

(i) If II A||i,(n) < 41. and A E IC{T, 3), Qr{T + A) - Qr{T) > |/|| A||F,(n); 

(ii) If A E /C(F,3), ||A|7 < ^^|| A||F 2 (n), where r = rank(F). 

The following technical lemma characterizes the convergence rate on the empirical error of the 
loss function. In the proof we repeatedly apply the Hoeffding’s inequalities and Assumption 4.2. 
The proof can be found in Section B.3 

LEMMA 4.3. Under Assumptions 4.1-4.4. Let 


A(t) = sup 

Gn 

r m 1 

m-^ ^ {pAY^j - Xj (F,, + A,,)} - pAAj - Xj F,,}) 

1 AllL2(n)<*AsA;(r,3) 


j=i J 


Then 

^ ^^/ 2{r V (1 -t)} aVC2Umax(^x) Iog(p + ^ 

with probability greater than 1 — 9{p + m)~‘^ — 7 ^, where C 2 ,C' are universal constants from 
Assumption 4.2 and Lemma C.l, a = with r = rank(F), / 3 r ,3 from Assumption 4.4, and 

p -|- m > 3. 
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The following theorem derives the bounds for the prediction error, Frobenius and nuclear norm, 
expressed in terms of A, condition number Slxj t and /. The proof follows similar steps as proving 
Theorem 2 in Belloni and Chernozhukov (2011), which explicitly exploits the convexity of the loss 
function and cone condition. 


THEOREM 4.1. Under Assumptions 4.1-4. 4, A> 2||VQ(r)|| and the growth condition on r: 

(4.6) 


Then 


^ a/ crmax(Sx) log(p + m) ^ A^ 4^/^ ^ 

.■■'mTS/. '"^ 1 ) 1 ^ ' 


If - riu.,„, < ^ « 

^ ’ m^/nf f 


|r-r||F <4a- 


/ <7max(5jx ) 

/log(p -1- m) 

/ i7min(^y) V 

n ■ 


- + 4A 


ma 


O'mmi^x) f 


|r-r||* <4a- 


a 


Y/crmax(Sx) /log(p m) , 


mf 


n 




(4.7) 

(4.8) 

(4.9) 


with probability 1 — 9{p + m) ^ — 7n, where a = with r = rank(r), /lr ,3 from Assumption 4.4, 
Cr = \/c 2 , C' > 0 is a universal constant from Lemma C.l, C 2 from Assumption 


4.2 and p -|- m > 3. 

Proof of Theorem 4-1- Let 

Hi = the event that Assumption 4.2 holds; 

H 2 = the event A{t) < ^^ »V-.nax(Sx)iog _(p+^_ 

Note that the probability of event P(Hi n H 2 ) > 1 — 7n — 9(7* + m)“^. Set 

t = log(p + m) ^ ^ 

my/nf 

We show that on Hi n H 2 , ||W'''A|| > t is infeasible. Let A = T — E. On event {||AC~''A|| > t}, 
from Lemma 4.1, one has 


0> inf Q,(r +A) -Q,(r)+ A(||r + All* - ||r||*), 

l|A|T2(n)>hAe>c(r,3) 


(4.10) 


As argued in the proof of Theorem 2 of Belloni and Chernozhukov (2011), from the facts that 
L Qt(-) + A|| • II* is convex; 

2. /C(r, 3 ) is a cone, 

(4.10) forces the value of Qt{T + A) -|- A||r -|- A||* on {A : ||A|| 2 , 2 (n) ^ t,A ^ /C(r,3)} to be 
less than that evaluated at A = 0. Convexity implies that Qt{^ + ^) + A||r -|- A||* evaluated at 
{A : ||A||ji 2 (n) = t,A. G /C(r,3)} must be smaller than that evaluated at A = 0. Thus, we have 
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the inequality 


0> inf Q,(r +A)-Q,(r)+ A(||r +All, - ||r||*), 

|iA|U2(n)=hAeK{r,3) 

It can be further deducted that 


o> inf g,(r + A)-Q,(r)-n-i/ 2 ^(t) + A(||r + A||,-||r||,), 

||A|U2(n)=i,AeK(r,3) 

By triangle inequality, |||r + A||, — ||r||,| < ||A||, < a||A|| 2 , 2 (n) = on the set {HAH^^in) = 
t, A G /C(r, 3)}. Furthermore, by Lemma 4.3, on event H Q 2 

A(t) < + 

m 

Therefore, on event fli n II 2 , it holds from Lemma 4.2 (ii) that 


o> inf g^(r + A) 

||A|T2(n)=t,Ae/C{r,3) 


^^(r) 


- Aat 

my/n 


Finally, applying Lemma 4.2 (i) as u > t/4 = || A|| 7 ^ 2 (n)/ 4 , we have 


0> int 

||A||i2(n)=hAe/C(r,3) 4- m^/n 


(4.11) 


With our choice of t, (4.11) cannot hold. Thus, the inequality (4.7) holds. 

The inequality (4.8) can be obtained by the simple observation that ||A||^^^j^j > (o' TT'in (Sy)/m)|| A 
The inequality (4.9) for ||A||, follows from the fact that A G /C(r,3) by Lemma 4.1, Lemma 
4.2 (ii) and the bound for ||A||^ 2 (n)- D 

Next lemma gives the bound for ^||X^W||. From which we obtain a bound for ||Vg(r)||. 


LEMMA 4.4. Under Assumption 4.1 and 4.2, 


hlX^WlI < C'Vf^ma.(Xx){r V (1 - r)}./^^, where C* = (4.12) 

n y n V ^ 

with probability greater than 1 — where C' and C 2 are absolute constants given 

by Hoeffding’s inequality C.l and Assumption 4.2. 

Let us take the rough bound /Ip ,3 > cT min fSyT- Lemma 4.4 and Lemma 4.1 suggest to 

take 

A = 2—^ara^{^x){TVil - (4.13) 

m \ n 

By the choice (4.13), Theorem 4.1 yields the oracle rate, which we summarize in Corollary (4.1). 

The last result in this section gives the rate of convergence under the choice of A given in (4.13), 
which will be the guideline for simulation comparison in Section 5. 

COROLLARY 4.1. Assume that Assumptions 4.1-4.4 hold and select A as (4.13). Under the 
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growth condition on r: 


C;_ / (Tmax(Sx) 
mV 


Then 


IT - r 




c; /a„„(sx)y ^ ^ _ ^, ^M iog(p + m) ; ^ ^ 


IT -TIIf < ^ 


/ V <in(^^ 

c; 


r-r L < —,, , 

" "* - / V <in(Sx) 

with probability greater than 1 — 7 „ — 9(p + m)“^ — and p + m > 3, where 

2 


C' = 8 V 2 




logs 


C" V (1 - r) 

C" = 4\/2C; with r = rank(r), /3r,3 from Assumption 4.4 and C 2 from Assumption 4.2. 
Proof of Corollary 4-1. Let events fli and VI 2 be defined as in the proof of Theorem 4.1, and 


(4.18) 


Qs = the event that —||X^W|| < C'*-\/||Sx||{t V (1 — —. 

n \ n 

Note that the probability P(ni n 112 H Us) > 1 — 7n — 9(p + m)“^ — On Hi n II2 O II3, 

the bounds (4.7), (4.8), (4.9), and (4.12) hold. Inserting the rate of A in (4.13) and the lower bound 
Pr ,3 > m~^/^-^/cr miTi (SyT into (4.7), (4.8), and (4.9) yields bounds (4.15), (4.16), and (4.17). □ 

REMARK 4.1. 1. The restricted nonlinearity constant ly enters the bounds only through the 

growth condition (4.14) on r. This corresponds to the Lasso for quantile regression of Belloni 
and Chernozhukov (2011). 

2. Component of the risk bounds: Corollary 4.1 shows that the errors are close to the estimation 
error given the true model. The bounds (4.15), (4.16), and (4.17) consist of three components: 
the dimensionality, covariance matrix of the covariates and conditional density of Y given X. 
When p and m are fixed with respect to n, the errors decrease in . p and m are allowed 
to grow with n; however, they are not allowed to grow faster than n. This phenomenon is 
also found in the multivariate regression for mean, see Negahban and Wainwright (2011), 
Koltchinskii et al. (2011) among others. Rank r of matrix T enters the bound as a factor, 
and r(p + m) is the number of unknown parameters. The covariates can influence the bounds 
(4.15), (4.16), and (4.17) through the condition number of the covariance matrix 

Sx- Large condition number also introduces instability to multivariate regression for quan¬ 
tiles as for mean. Finally, the minimal value of densities / and the quantile level r are related 
to the conditional distribution of Yij give Xi and are only seen in multivariate regression 
for quantiles. We show in (4.15), (4.16), and (4.17) that small minimal value of densities /, 
which may result from the large support of Yij, can result in inaccurate estimation. On the 
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other hand, the estimation at r close to 0 or 1 is also difficult as r V (1 — r) enters as a factor 
to the estimation errors. 


5. Simulation 

In this section we check the performance of the proposed method via Monte Carlo simulations 
and verify the oracle properties in Section 4. In the first set of simulation, we consider three 
symmetric models, which are different in terms of the degree of sparsity. In the second set of 
simulation, an asymmetric setting is considered with two different degree of sparsity. We consider 
three symmetric models with different degrees of sparsity in Section 5.1. Section 5.2 is devoted to 
two asymmetric models. 

5.1. Symmetric models 

We consider three models that differ in complexity: 

• Model LS (Less sparse): Set m = p = n = 500. In each iteration, each entry of the p x m 
coefficient matrix F is generated from a i.i.d. normal distribution. Setting the last 375 
singular values of F to 0; 

• Model MS (Moderate sparse): Generating F as Model LS. Setting the first 10 singular values 
to 30, and 0 for the rest; 

• Model ES (Extremely sparse): Generating F as Model LS. Replacing the first singular values 
by 20, and 0 for the rest. 

Given the F generated by the model above, at each iteration, we generate Xi from A^(0, S) with 
aij = 0.5l*“-^L The response variable is generated as 

Y^ = r'^Xi + ei, (5.1) 

where Si is a random vector in which each element is from i.i.d. standard normal distribution. 

We estimate the model at quantile levels r = 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95. In order 
to get some ideas on the solution path, we set A = (5 x 10“®,10“^,5 x 10“®, 10“^) for com¬ 
parison purpose. For reference, using the tuning technique in Section 3, the simulated A = 
(0.00477,0.00465,0.00438,0.00346) for r = 5%, 10%, 20%, 50%. The A for r = 95%, 90% and 80% 
are the same as that of r = 5%, 10%, 20% by symmetry. The iteration run is 500. 

We stop the SFISTA algorithm at step t when the difference of loss function at step t — 1 
and t is less than 10“®. Moreover, considering the size of our model and the choice of k in the 
simulation study of Chen et al. (2012), we directly set k = 0.0001, rather than applying the k given 
by Theorem 3.2. 

The performance of F is measured by: 

• Prediction error: m“^||X(F — F)||f; 

• Model selection: Frobenius error ||F — F||f and nuclear error ||F — F||*; 

• Estimated number of nonzero singular values; 

• Computational time. 
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The number of nonzero singular values is determined by the sudden drop in singular values of T. If 
the drop from fth singular value to (f + l)th singular value is greater than a given threshold, then 
we record the number of nonzero singular values as r. Notice that the three symmetric models only 
differ in sparsity. From the simulation, we can clearly see what role sparsity plays. 

The results are shown as boxplots from Figure 5.2 to 5.4. Each figure consists of five rows, 
which presents the prediction error, Frobenius error, nuclear error, estimated number of factors 
and the computational time, and the columns correspond to different values of A. 

The errors as functions in r of the three models show ”V” shape. This confirms the term 
r V (1 — r) appeared in the oracle bounds in Theorems 4.1. Furthermore, the model complexity 
rank(F) influences the error. Among the three models, the errors are smaller in the most sparse 
Model ES and larger in the less sparse Model LS. This confirms the factor rank(r) appeared in the 
oracle bounds given in Theorems 4.1. 

The A inducing the smallest error in the simulation of each model slightly differs. Notice that 
all components involved in selecting A in (4.13) are equivalent for the three symmetric models, so 
the optimal A should be the same for the three models. In addition, A changes the way how errors 
depend on r. In Model LS, the ”V” shape shown in the Frobenius and nuclear deviation becomes 
more flat. Hence, in such model we should choose a smaller A when the quantile at level r = 0.5 is 
to be estimated, and a bigger A when the quantiles at r close to 0 or 1 are to be estimated. 

The number of factors selected for the three models are generally accurate. We find that for 
T = 0.5 the algorithm almost always makes correct selection for all the choices of A and all the 
three symmetric models. For Model ES the algorithm selects the correct number of factors even 
for r = 0.2,0.8 when A is large. For other r, particularly the extremes ones close to 0 or 1, it is 
more difficult to recover the true number of factors. 

About the computational efficiency of our algorithm, the time required for the algorithm to 
converge increases with the complexity. This fact corresponds to the term ||r* — FqIIf in inequality 
(3.7). When we look at the most sparse Model ES Figure 5.4, the algorithm converges in less than 
80 seconds in the best case A = 10“^. For Model LS and MS, smaller choices of A usually imply 
longer time for the algorithm to converge, while larger choices of A allow the algorithm to converge 
in less than 250 seconds for Model LS and 100 seconds for Model MS. On the other hand, r has 
influence on the convergence time, which corresponds to the inequality (3.7) and the third point of 
Remark 3.1. For example, in the last row of Figure 5.2 and 5.3, the case r = 0.5 takes least time 
when A is small, but this situation reverses in the most sparse Model ES. 

5.2. Asymmetric models 

To further illustrate our method, beside adjusting the level of sparsity as done in Section 5.1, 
in this section we specify asymmetric models for the conditional distribution of Yij. Let F i and F 2 
be two p X m matrices of rank ri and r 2 with following two specifications: 

• Model AES (asymmetric extremely sparse): (ri,r 2 ) = ( 2 , 2 ); 

• Model AMS (asymmetric moderately sparse): (ri,r 2 ) = (2,10). 

For each model, two matrices Fi and F 2 are chosen: 

1. Generating vectors {ai,...,ari} and { 61 ,..., 6 ^ 2 } The components of each vector are 

i.i.d. uniform distributed random variables supported on [ 0 , 1 ]; 
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2. Each jth column in Fi is Oikjcik where akj are independent random variables in C/[0,1]; 

similarly, each jth column in r 2 is Yl^k=i Pkjbk where /3k,j are independent random variables 
in C/[0,1]. 

Now we discuss the data generation. Let Uij be i.i.d. uniform random variable supported on 
[0,1], i = and j = 1,...,500. Generating Xi from A^(0,S) with aij = and then 

setting Xi = Xi will have support [0,1]^ and be correlated according to Falk (1999). The 

response variables are generated by 

Yij = <1> ^(Uij)Xj [Tl^^,jl(Uij < 0.5) +r2,*jrT(t/ij > 0.5)] , (5-2) 

where $(•) is the cdf of A^(0,1). is i.i.d. by construction. Notice that when conditioning on Xi, 
the randomness comes only from Uij, which is independent of Xi. Hence, Yij is independent in j 
when conditioning on Xi. 

The exact conditional quantile function qj{T\x) of Yij on x is 

qj/rlx) = <h“^(r)a;^ri^*j, r < 0.5; 
qj{T\x) = <h“^(r)a;^r 2 ,*j, r > 0.5, 

for j = 1,..., 500. Note that at <h“^(0.5) = 0, and therefore the coefficient matrix at r = 0.5 is 0. 

Figure 5.1 gives an illustration of the marginal densities of Yij for j = 1, ...500. The left figure 
is associated with Model AMS in which the densities tend to be asymmetric, in the sense that they 
have thick right tails and thin left tails. The densities are also more disperse. The right figure is 
associated with Model AES, and the densities are more symmetric and less disperse. 
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Figure 5.1: The plot of all 500 marginal densities of in asymmetric models. The left figure is 
associated with Model AMS in which the densities tend to be asymmetric (thick right tails and 
thin left tails). The right figure is associated with Model AES in which the densities are more 
symmetric. 

The simulation run is 500. The measure of performance is the same as that of symmetric 
models. In this simulation, we select A = (0.005,0.01,0.05,0.1). The numerical performance of 
the asymmetric model is shown in Figure 5.5 and 5.6. For reference, the simulated A for r = 
5%, 10%, 20%, 50% are A = (0.002308,0.002310,0.002314,0.002308). The A for r = 95%, 90% and 
80% are the same as that of r = 5%, 10%, 20% by symmetry. 

Some patterns can be observed from the simulated estimation errors of the two models. Despite 
the fact that Fi ^ F 2 , the asymmetry in distribution is not significant and the error as a function 
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of T from Model AES is in symmetric ”V” shape. This again corresponds to the factor r V (1 — r) in 
Theorems 4.1. In terms of the choice of A, small A appears to give smaller errors for both models. 
However, the errors corresponding to r > 0.5 in Model AMS are notably higher than those in 
Model AES. This is owing to the fact that the matrix r 2 in Model AMS is less sparse than Model 
AES. This simulation result confirms the factor rank(r) in the oracle bounds in Section 4. 

The number of nonzero singular values is almost always correctly estimated in Model AES. As 
expected, the estimated number of nonzero singular values of Model AMS is higher than that in 
Model AES when r > 0.5. However, we find that the estimated number of nonzero singular values 
is 2 in Model AES and between 5-7 in Model AMS, seemingly the average of the rank of Ti and r 2 . 
However, the true number of nonzero singular values at r = 0.5 is exactly 0. This shows that the 
singular values are hard to be accurately estimated if the coefficient matrix is not continuous in 
r. 

The computational time generally follows the rule of (3.7). When A is small, we find that 
the variation of r = 0.5 tends to be large. Due to high rank(r 2 ) in Model AMS, it is more 
computationally demanding to recover T,- for r > 0.5, as implied by the term ||r* — To^tIIf in 
inequality (3.7). 


21 


5e-06 A, = 1 e-05 A, = 5e-05 A, = 1 e-04 



JOJJ0 uoipipejd JOJJ0 sniusqojj jojj0jB0pnN sjopBjjO'ON 0LU!rclLuoo 


22 


Figure 5.2: The symmetric Model LS. The horizontal axis is r. The true number of factors is 125. 
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Figure 5.3: The symmetric Model MS. The horizontal axis is r. The true number of factors is 10. 
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Figure 5.5: The asymmetric Model AES. The horizontal axis is r. The true number of factors is 2 for r < 0.5 and 10 for r > 0.5. 0 for 
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Figure 5.6: The asymmetric Model AMS. The horizontal axis is r. The true number of factors is 2 for r < 0.5 and 10 for r > 0.5. 0 for 

























6. Real data application: SAMCVaR model 


In this section, we apply the regularized multiple quantile regression on financial data. In 
Section 6.1, we propose a modification of CAViaR model. Section 6.2 deals with the data selection 
and choice of the tuning parameter A. Section 6.3 is devoted to the empirical findings. 


6.1. Model 


Since Engle and Manganelli (2004) proposed the conditional autoregressive value at risk (CAViaR) 
model, financial econometricians have applied it in many empirical studies and proposed many vari¬ 
ations for it. The model analyses a univariate autoregressive structure in quantiles, which does not 
account for the interdependence of asset returns. As the financial spillover effect has been widely 
understood as a risk source, the quantification of spillover effects has been an important issue for 
financial econometricians. 

White et al. (2008) introduce a multi-quantile modification of CAViaR (MQ-CAViaR), which 
allows a sequence of conditional quantile of asset returns to depend on each other. Combining with 
the robust estimation for skewness and kurtosis using quantiles of Kim and White (2004), they study 
the time varying patterns of higher moments of asset returns. In White et al. (2015), they consider 
the spillover effect in asset returns by the multivariate MQ-CAViaR (MVMQ-CAViaR) model, 
which combines the MQ-CAViaR models of a set of asset returns. Nonetheless, they estimated a 
simpler bivariate CAViaR for each asset return with a single universal market index, for which they 
took the World Financials price index provided by Datastream. 

In contrast to previous models, we consider a multivariate model which jointly incorporates 
multiple asset returns. Let Yj^t be the asset return for firm j, j = 1, at time t, t = 1, 

Let qtj{T\J^t-i) be the conditional quantile at level r for asset return j at time t on filtration 
J-t-i- From the spirit of multivariate CAViaR, we consider the Sparse Asymmetric Multivariate 
Conditional Value-at-Risk model (SAMCVaR): 

m m 

qt,j{T\J^t-i) = ^li,j,k{'^)\yt-i,k\ +^l 2 ,j,k{'^)Y^Zl^k^ (6.1) 

k=l k=l 


where Y = max{-V,0} and = (7i,i(T)"^, 72 j(t)"^)"^ and 'yijir) = ( 77 i(t), •••, 7/j,m(T)) 

for I = 1,2. The rank r of T satisfies r <^m. Following the discussion in Section 2, we impose the 
condition that Yl'k=i ^jk — 


Xt-i = (iV-i 




Yr 


.Yr 


\T 


G 


\)2m 


( 6 . 2 ) 


We may therefore rewrite (6.1) as 


qt,j{T\Ft-i) = qt,j{T\Xt-i) = Xj_;^T^j{T). 

If letting q^{T\Xt-l) = {qt,i{T\Xt-i), ...,qt,m{T\Xt-i))~^ be a vector of quantiles of all the firms in 
the sample, then q^{T\Xt-l) = where T = [r*i,..., and we have the multivariate 

quantile regression model (2.4) 

This model is a multivariate variation of CAViaR, and we replace the autoregressive qt-ijir) 
in CAViaR model by a dispersion measure for asset j in the information set at time t — 1. 

The inclusion of the lag negative return which also appears in the CAViaR model with 

” asymmetric slope”, is based on the intuition that ” one bad day makes the probability of the next 
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somewhat greater” (Engle and Manganelli; 2004). Two major features of model (6.1) are that the 
quantile of each firm is time-varying] moreover, (6.1) accounts for the spillover effect on financial 
firm j from hnancial firm I ^ j. 

We estimate T via the nuclear norm regularized multivariate quantile regression. We select 
T = 1% and 99%, in which r = 1% corresponds to the VaR of the asset returns, while r = 99% 
corresponds to the growth potential of the assets. 

The factorisation described in Section 2 is applied to gain an insight into the common structure. 
We factorise the covariates into factors ft, lir), ...,ft,r{T) where r <C m by using the left singular 
vectors of T. We investigate two aspects related to the factors. The first is how a firm Tj-ij 
contributes to the factor; the second is how sensitive the conditional quantile of a firm is relative 
to the factor. We may study the contribution of firm j to the variation of the market by the 
coefficients associated to the two transformations | Ytj \ , Yff in the factor fk: 


Contribution from component j to /^(t) 


dfkir) 

dm,Yr) 


) V2,k,j ) • 


(6.3) 


Note that the contribution from component j to fr{T) does not vary over time. On the other hand, 
the sensitivity to the variation of the market can be described by 


Sensitivity of j quantile to /fc(r) 


dqj{T\X) 

dfkir) 


'f^j,k- 


(6.4) 


With the singular value decomposition T = UDV^, the contribution of j hrm to the factor 
fk defined in (6.3) can be computed by the j,j + m element in the G times ak, where 
iTfc is the fcth singular value on the diagonal of D. The quantity in (6.4) can be found by the kth 
component in 


6.2. Data and tuning 

We obtain a set of stock prices consists of m = 230 major global financial firms. The dataset 
can be downloaded from Simone Manganelli’s website, which is used in White et al. (2015). Their 
data period is from Dec. 31, 1999 to Aug. 6, 2010. The regional and industrial characteristics can 
be found in Table 1 of White et al. (2015), which we include in Table 6.1 for completeness. 



Bank 

Financial Service 

Insurance 

Total 

EU 

47 

22 

27 

96 

North America 

25 

17 

28 

70 

Asia 

47 

14 

3 

64 

Total 

119 

53 

58 

m = 230 


Table 6.1: The summary of financial firms in the data. 


We use the data from August 31, 2007 to August 6, 2010. There are 766 closing prices for 
each stock in the sample. We compute the daily log-return. This results in sample size n = 765. 
The dimension of the input variables Xt is p = 2m = 460, as we consider two transformations 
for each asset return, as in formula (6.2). Figure 6.1 shows the time series plots of the log-returns 
of the 230 financial institutions over this data period, and a plot of volatility index (VIX) kept 
by Chicago Board Options Exchange. The plot of asset returns suggests there are two large high 
volatility clusters before and after the beginning of the year 2009, which corresponds to the subprime 
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mortgage crisis. Another phase of volatility increase is around mid 2010, which corresponds to the 
rising concern of the European debt crisis. The data show strong asymmetry, as the returns 
demonstrate high negative skewness. Though VIX is constructed by the returns of the S&:P500 
constituents, it appears to approximate the global financial risk too. 




>< 

> 


Figure 6.1: The upper figure shows the time series plots of the 230 global financial institutions with 
different grey level distributions and thicknesses. The lower figure shows the time series of VIX. 


To select the tuning parameter A, applying the procedure described in Section 3 gives A = 
0.02467565 for r = 1%. By symmetry we also apply A = 0.02467565 for r = 99%. 


6.3. Empirical findings from global financial data 

In this section we discuss the empirical findings from factorizing the multivariate quantile re¬ 
gression (6.1) at level r = 1% and 99%. After the factorisation by SVD, the time series plot of the 
first two factors for the two sets of quantile regression are reported in Figure 6.2. Both first factors 
/]^(0.01) and /^(0.99) are volatile and moving away from 0 at the end of 2008 and in the first 
quarter of 2009, and mid 2010, which corresponds to the phases of volatility increase as indicated 
in Figure 6.1. Moreover, as can be seen from the figures, the two time series /;^(0.01) and /;^(0.99) 
are negatively correlated. The absolute scale of the two second factors /2(0.01) and /2(0.99) are 
much smaller than the first factors. A sharp peak appears in the plot of /2(0.01) at the first quarter 
of 2009. The time series of /2(0.99) is volatile before and after the beginning of the year 2009. 

In what follows. Section 6.3.1 presents the estimated factors and the analysis of r-range at 
r = 1%. Section 6.3.2 presents the tail factor analysis at r = 1%. 


6.3.1. r-range analysis 

In this subsection we discuss the common structure of r-range of global financial returns with 
r = 1%. 

First we consider the contribution to and the loadings associated with the first factor. Figure 
6.3 shows that the contribution to the first factors lie in the second quadrant, which suggests that 
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1st factor 


2nd factor 




Figure 6.2: The time series plots for the hrst 2 factors. The black lines corresponds to 1% quantile 
factors and the blue lines corresponds to 99% quantile factors. 

all the covariates have negative impact to the first factor of 1% multivariate quantile regression, 
and positive impact to the hrst factor of 99% multivariate quantile regression. The dots and hrm 
names in black represent the lag absolute log-returns, and they tend to lie around the diagonal line 
or even above it. This suggests that the absolute lag log-returns tend to contribute equally to both 
/]^(0.01) and fi{0.99), which is consistent to the intuition that higher return in period t — 1 induces 
higher volatility in period t. On the other hand, the lag negative part j marked in red are more 
located below the diagonal line, which suggests that the j contributes more to /^(O.Ol) than 
to /;^(0.99). The well-known ’’leverage effect” postulated by Black (1976) suggests the tendency 
that the volatility of an asset is negatively correlated to the the asset return. Furthermore, it is 
suggested that such effect is asymmetric: the coincidence between the loss in period t — 1 and larger 
volatility in period t is more frequent than the coincidence between the gain in period t — 1 and lower 
volatility in period t, as documented by Engle and Ng (1993). However, as volatility or variance 
is a symmetric measure of dispersion of distribution, it is incapable of revealing information of 
the potentially asymmetric contribution to such dispersion. Figure 6.3 uncovers the fact that the 
increasing dispersion (volatility) of the distribution in asset return in response to the nonnegative 
loss j is largely due to the drop of lower quantile factor /^^(O.Ol) rather than the rise of upper 
quantile factor /^(0.99). In particular, such increase in volatility creates as much impact in loss 
but relatively less potential in gain. 

Figure 6.4 illustrate the loadings to the first factor of of 1% and 99% multivariate quantile 
regression. The loadings are all positive, and lying on the 45 degree line, which suggests that the 
firm highly associated with the first factor of 1% MQR would also be highly associated with the 
first factor of 99% MQR. This implies that the direction of change in the r-range over the returns 
is similar, but the magnitudes is different. Indeed, the firms lying on the far top right corner 
are the firms with high market risk sensitivity, including Huntington Bancshares Inc., American 
International Group, Allied Irish Banks and more, whose time series patterns best resemble that 
of the first factors /^(O.OI) and /^(0.99). The return time series of several risky firms are shown 
in Figure 6.9, in the sense that during financial crisis of 2008-2009, the range of their distribution 
is very disperse, which leads to large volatility. 

Second factors /2(0.0I) and f 2(0.99) describe the extreme market movement in the beginning of 
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Figure 6.3: The contribution to the first factor of 1% and 99% MQR from the 230+230 covariates. 
The firm name and the dots in black denote the lag absolute log return \Yt-ij\. Dots and firm 
name with in red denote the lag negative return Yr, •. 
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Figure 6.4: The factor loadings of 230 firms on the hrst factors /^^(O.Ol) and /^(O.OO). 


2009. In Figure 6.5, the dots in black, corresponding to the contribution from lag absolute returns, 
are more located above the line corresponding to zero contribution to factor f 2 ( 0 .99), while the 
red dots, corresponding to the contribution from the lag negative part, tend to gather below the 
line. This suggests again that the lag negative part has less impact on the factor associated with 
the upper quantile. Moreover, the dots lying at far right and separate from the other points are 
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associated mainly with the Bank of Ireland, Allied Irish Banks and the Royal Bank of Scotland 
Group, which lead to the peak of f 2(0-99). 
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Figure 6.5: The contribution to the second factor of 1% and 99% MQR from the 230+230 covariates. 
The firm name and the dots in black denote the lag absolute log return Dots and firm 

name with in red denote the lag negative return j- 


The factor loadings of firms on the second factor /2(0.01) can be applied to distinguish the 
firms being influenced most by the sharp peak of /2(0.01) at the beginning of 2009. In Figure 6.6, 
the loadings of asset returns on / 2 ( 0 . 01 ) and /2(0.99) are mainly distributed in the first and third 
quadrants. The firms whose r-range influenced negatively by the peak of /2(0.99) are in the second 
and third quadrants. In particular, the r-range of the firms located in the second quadrant shift 
downward at the beginning of 2009. On the contrary, the r-range of the firms located in the third 
quadrant expands at the beginning of 2009. The r-range expanding the most are PNC Financial 
Services Group, Inc., State Street, Lloyds Banking Group PLC. 


6.3.2. Tail factor analysis 

In this section, we discuss the empirical findings by looking at the contribution to and the 
loadings associated with the first two factors of 1% multivariate quantile regression. 

Figure 6.7 illustrates the contribution from the covariates to the first and second factor of 1% 
MQR. Lag negative returns concentrates on the lower right of the figure and is below the horizontal 
line y = 0, and lag absolute returns spread around the horizontal line y = 0. The absolute and 
negative lag return of Allied Irish Bank, Bank of Royal Scotland Group and Bank of Ireland are 
more isolated and located at the top left corner, and are highly related to the first and second 
factor of 1% MQR. This suggests that they have high association with the downside risk of the 
global financial market. 

Figure 6.8 shows the factor loadings of each firm on the first and second factors of 1% MQR. 
The points are gathering on the top left with positive loadings on factor 2, and then spreading to 
the lower right like a fan. The pattern suggests that the firms positively associated with the first 
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Figure 6.6; The factor loadings of 230 firms on the second factors /2(0.01) and f 2(0-99). 
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Figure 6.7: The contribution to the first and second factor of 1% MQR from the 230+230 covariates. 
The firm name and the dots in black denote the lag absolute log return Dots and firm 

name with in red denote the lag negative return 


factor of 1% MQR tend to be negatively associated with the second factor of 1% MQR. Together 
with the information that the first factor of 1% MQR is generally negative and the second factor of 
1% MQR has a positive peak from Figure 6.2, Figure 6.8 implies that the firms lying on the lower 
right direction bear high market risk in our sample. 

The similarity in the lower tail of the distribution can be inferred by the distance in Figure 6.8. 
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The shorter the distance between the two points on Figure 6.8, the larger their similarity is in the 
1% quantile. For example, the distance between State Street and PNC Financial Services Group, 
Inc. is close, and their 1% quantile time series have similar behavior, which can be seen from their 
time series plots in Figure 6.9. 
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Figure 6.8: The factor loadings of 230 firms on the second factors /^(O.Ol) and /2(0.01) of 1% 
MQR. 
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Figure 6.9: Plots of individual asset time series and their 1% and 99% fitted quantiles. 

35 































7. Factor curves model 

In this section, we extend the parametric linear multivariate quantile regression model to a 
nonparametric model, in which the unknown conditional curves are approximated by sieve spaces. 
Section 7.1 introduces the factorisable ’’quantile curve” and the factor curves. Section 7.2 deals 
with the estimation of the model. Section 7.3 applies the nonparametric factorisable quantile curves 
on the temperature data of 159 weather stations from China and classifies the primary patterns in 
Chinese temperature. 

7.1. Model 

For functional data, the concept of ’’quantile” is not as well understood as that for a usual uni¬ 
variate random variable. The functional data can be understood as the realizations of a functional 
variable (see, e.g. Ferraty and Vieu (2006)), which is a map Y : Q —?■ C, where fl is the sample space 
and C is the set of all continuous function on T. Without loss of generality, T can be a bounded 
interval. As an example, the standard Brownian motion W{uj,t) is also a functional variable. 

Definition 7.1 (Quantile curves). For 0 < r < 1, the r quantile eurve qrit) of functional variable 
Y is also a continuous function in t satisfying 

P({a; : Y{u,t) < qr{t), Vf G T}) = r. 

For fixed f G T, it holds that P({cu : Y{uj,t) < qrit), Vt G T}) = r. Taking standard Brownian 
motion W{t) as an example, the r quantile of W{t) is qrit) = \/t^~^{T), where <!(•) is the cdf of 
standard normal distribution. When r is close to 0 or 1, we call qrit) a tail event eurve. 

Consider m functional variables Yi{t), ...,Ym{t), denote their quantile curves qr,j{t). Suppose 
that qT,j{t) lies in T which is the class of functions / defined on [ 0 , 1 ] whose sth derivative 
exists and satisfies a Lipschitz condition of order 7 ; 

- /^^'^(t)| < C\t' - for t',t G [a,b], 

for s = s'+9 > 0.5. We assume that s' > 1 and 9 > 0 throughout the following discussion. Based on 
the construction of Schumaker (1981) and Stone (1985), each function q^-j G IF can be approximated 
by an element qn^T,j{t) £ Sn so that \\qn,T,j — QtjWcx) = C){Pn^) (see the discussion in p. 150 of Newey 
(1997)), where Sn is an expanding functional class with basis functions {bi,l < I < pn}- Denote 
b{t) = ( 6 i(t), ..., 6 p„(t)), so that 

qn,r,jit) =Tjjb{t), (7.1) 


where is jth column of matrix P. 

The timings of measurement are ti, ...,tn for all j. Denote B = (Bn) G with = bi(ti) 

and Y = (Yij) G with Yij = Yj{ti). The matrix P can be viewed as the coefficient matrix in 

the multivariate quantile regression model 


Qn,r(i) = BP. 

Qnri't) = iQn,T,i(t), ■■■,Qn,T,mi't)), and P Can be estimated as in Section 3, but now the covariates 
are the values of the basis functions evaluated at ti, ...,tn- 

Furthermore, model (7.1) is also factorisable. If the SVD of P is P = UDV"'^ and the number 
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of singular values of F is r. Similarly to (2.5) 


r 

Qn,T,j{t) = Vj^kfkit), (7.2) 

k=l 

where /^(t) = (TfcUj)i.fo(t) may be called factor curves with factor loadings Vj^k- 

7.2. Estimation 

Similar to Section 3, we minimize the following loss function: 

n m 

(nm)-i ^ ^ pr{Yij - BlT,j) + A||r||* Q,,f.(r) + A||r||„ (7.3) 

i=l j=l 

with Pt{u) = u{t — 1 {m < 0 }) with given 0 < r < 1 . 

The empirical loss QT,b{Y) is non-smooth. Apply the approach in Section 3, the smoothed 
version of Qr,b{^) with a Lipschitz gradient is QT,b,K{B)- Algorithm 2 can be directly applied by 
using Qr,b,KiB)- The convergence analysis is similar to Theorem 3.2. The details are omitted for 
brevity. 


Algorithm 2: Smoothing fast iterative shrinkage-thresholding algorithm (SFISTA) 

| 2 . 


1 Input: Y, B, A, K = 75 -^, M = — 

2 Initialization: Fq = 0, Hi = 0, step size (5i = 1; 

3 for t = 1, 2, ..., T do 


rf = s. 


5t+i — 


XjM 

2 




Hi+i — Fi -|- 


it-i 

<5t+i 


(Tt - Fi_i 


7 end 

8 Output F = Fr 


For the choice of the number of spline basis pn, from bias and variance decomposition of spline 
estimator (Huang; 2003), under the fact that the functions to be estimated in our case are univari¬ 
ate, the convergence rate of the estimator is Op{p~^ + The order of pn minimizes the 

convergence rate is usual assumption is s = 2 . 

7.3. Application: Chinese Temperature Data 

In this section we apply the nonparametric multivariate regression model to real data. We utilize 
the Chinese temperature data in the year 2008 from 159 weather stations around China, which can 
be downloaded from the website of Research Data Center of CRC 649 of Humboldt-Universitat zu 
Berlin. The dataset consists of one year time series of daily averaged temperature. 

Before applying our method, we first fit a mean curve with smoothing spline which describes 
the mean temperature of China in the year 2008. In Figure 7.1, the bottom subfigure is the fitted 
trend curve, which shows seasonal pattern. The detrended temperature time series of 159 weather 
stations in the top figure of Figure 7.1 also demonstrate a seasonality pattern. The deviation to 
the mean temperature is larger in winter than in summer among these weather stations. 
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Figure 7.1: The temperature time series in excess to national mean of the 159 weather stations 
around China with different grey levels and thicknesses. The figure below is the temperature trend 
curve estimated by smoothing spline. 


We will apply the nonparametric multivariate quantile regression to further investigate the 
detrended temperature curves. The S-spline basis functions are used, and the number of basis 
function is p = = 11. The timing of measurement is daily ti, ...,1^^^. The quantile levels are 

r = 1% and 99%. We choose the tuning parameter A by applying the procedure of simulating (3.8) 
and compute A by (3.9), the estimated value is A = 0.000156. 

7.3.1. r-range analysis 

In this section, we present the factors and the r-range analysis using the loadings to the factor 
curves. 

Figure 7.2 presents the first four factors. The hrst factor of 1% and 99% quantile regression 
enclose a region which is wide in both ends and narrow in the middle. This matches our observation 
for Figure 7.1 that the deviation in temperature among weather stations tends to be higher in winter 
but lower in summer. Moreover, the two first factors captures two types of seasonality. The reverse 
V or U shape of the first factor of 99% multivariate quantile regression represents a ’’seasonality 
at high temperature”, while the V or U shape of the first factor of 1% represents a ’’seasonality at 
low temperature”. 

The factor loadings of the first factor for 1% and 99% quantile regression demonstrate a nearly 
”L” shape, as shown in Figure 7.3. This suggests that the weather stations positively associated 
with the first factor of 1% multivariate quantile regression have almost no association with the first 
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Figure 7.2; The time series plots for the hrst 4 factors. The black lines corresponds to 1% quantile 
factors and the blue lines corresponds to 99% quantile factors. 

factor of 99% multivariate quantile regression. Such dichotomy pattern allows for classifying the 
weather stations into groups. 

In Figure 7.3, the temperature curve of Tulihe has the highest factor loading in the first factor 
of 1% multivariate quantile regression, while the temperature curve of Dongfang has the highest 
factor loading in the first factor of 99% multivariate quantile regression. Thus, Tulihe is classihed 
as showing strong ’’seasonality at low temperature” and Dongfang shows strong ’’seasonality at 
high temperature”. Notice that the factor loading to the first factor of 99% multivariate quantile 
regression is slightly negative for Tulihe, and the factor loading to the first factor of 1% multivariate 
quantile regression is close to 0 for Dongfang. Another weather station marked in the figure 
is located in Yushu, which has small positive loadings to the first factor of both 1% and 99% 
multivariate quantile regression, and is hard to be classified to any of the two seasonality patterns. 

7.3.2. Selected ^veather station analysis 

This section discusses the three selected weather stations from Figure 7.3. 

Figure 7.4 shows the temperature plot, 1% and 99% quantile curves, and the location of the 
three weather stations marked in Figure 7.3. Tulihe is located in far northeastern Inner Mongolia, 
China, which is well-known for its bitter cold in winter and large temperature difference between 
summer and winter. While the estimated 99% factors are mainly influenced by the temperature 
curves from warmer areas, the reverse V-shaped yearly temperature curve of Tulihe cannot be 
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Figure 7.3: The plot of weather stations based on their factor loadings to 1% and 99% multivariate 
quantile regression. Each point denotes a weather station in China. 


captured by the 99% factors, and the estimated curve is flat. Dongfang, however, is located in 
tropics, and in winter at warmest the temperature is 25 degrees Celsius higher than the national 
average. The estimated 1% factors dominated by cold regions cannot fit the V-shaped yearly 
temperature curve of Dongfang, so its 1% quantile curve is flat. Yushu is located in central west 
China and belongs to highland climate. The average altitude in the region of Yushu is over 4000 
meters. It has high temperature variation within a day, and is generally slightly cooler in summer 
and warmer in winter than the national average. The seasonality for Yushu is not significant, and 
both the 1% and 99% factors do not fit Yushu’s yearly temperature curve well. 
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Figure 7.4: Plots of temperature observations, 1%, and 99% temperature quantile curves at the 

three selected weather stations in the year 2008. The location of the weather stations are marked 

in the upper left map of China. 
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Appendices 

A. Proofs for algorithmic convergence analysis 

A.l. Proof for Theorem 3.1 

Proof of Lemma 3.1 . One can show by elementary matrix algebra that 

n m 

e{T,®) = Y.Y1 -^7 r,,) 

i=l j=l 

n m n m 

2 = 1 J = 1 2=1 j = l 

= (Y,0) + (-xr,0). 


□ 


Proof of Theorem 3.1. To verify the conditions in Theorem 1 of Nesterov (2005), let (T 2 = 1, d{® ) = 
||0||p/2. Lemma 3.1 implies that 0(0) = (Y,0), which is convex and continuous. Applying 
Theorem 1 of Nesterov (2005) yields the desired result. □ 

A.2. Proof of Theorem 3.2 

Let L(r) = 4,,(r) + A||r||,. 

\L{rt) - L(r)| = |L(rt) - Z(rt)| + \L{Tt) - L(r)| + - L{r*)\. (a.i) 

We have for any F, 

L(r) < L(r) < L(r)+ K max < L(r)+ K(r V{1-r})2 —, (A.2) 

©e[T-i,T]"x™ 2 2 

where the first inequality directly follows from the definition of f^iT^W) in (3.3) and the second 
from the fact that max 0 g[.,-_i ,-]nxm || 1 T||| = max 0 g[.,-_i ,-]nxm Yli<n,j<m®ij ^ (''' V {1 - r})^nm. 
Hence, for any matrix F, by the choice of k in Algorithm 1, 


L(F)-L(F) 


nm{T V {1 -r})2 e(rv{l -r})2 

< K - < -^-. 


(A.3) 
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Since Theorem 3.1 implies that the gradient of (5T-,K(r) is Lipschitz continnous with Lipschitz 
constant M, by Theorem 4.1 of Ji and Ye (2009) or Theorem 4.4 of Beck and Teboulle (2009) 
(applied in general real Hilbert space, see their Remark 2.1), we have 


L{rt)-Lir*) 


< 


2M\\To - T* 


(A.4) 


where M is given in Theorem 3.1. Combining (A.3) and (A.4), pick k = ej{2mn), insert M = 
j^||X|p by Theorem 3.1, (A.l) can be estimated by 


|L(ri)-L(r)|< 


e(rV{l-T})2 , 4||ro-r 


+ 


|2 

If 


|X|| 


(t+ly 


emn 


(A.5) 


Setting the right hand side of (A.5) to be e and solve it for T yields the bound (3.7). 


B. Proof of oracle inequalities 

B.l. Proof for Lemma 4.1 

The key is the decomposability of the nuclear norm. Again let A = F — F, 

0<Q.(F)-4(f) 

< ||V(5r(F)|| II A||* + A(||F||* — ||F||*) (subgradient condition) 

< ||VQ.(F)|| (llPr(A)ll, + ||Pft(A)||,) + A(||Fr(r)|U - ||Pf^(f )||* - Pr(f)) 

(Decomposability of || • ||*) 

< l|V(3.(r)||(||Pr(A)||. + ||Pit(A)||.) + A(||Pr(A)||. - ||Pit(A)||.), 

Rearrange to get, 

(A-||VQ.(F)||)||RiL(A)||, <(A+||VQ.(F)||)||Rr(A)||,. 

Choose A > 2||VQ^(F)||, 

tA||Pit(A)||. < ^A||Pr(A)||.. 

Hence, HRp (A)||* < 3||'Pr(A)||*. Note that this condition appears also in Negahban and Wain- 
wright (2011) Eqn. (12) in pp. 1077. 


B.2. Proof for Lemma 4.2 

1. Let QT-j(F*j) = E[pr{Yij — X^^F^j)]. By Assumption 4.3, 


T A^j) (5r,j(P=(=j) — L 

= E 




F,, + z) - FY^\xSXjT,j)d 


Ljo 


r-Xj A. 


Ljo 




^fvjlXiiXjT^j) + y/y^|^^(x7F*j + z^)dz 
2 


-g/'E[|x7A,,fi 
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for G [0,-z]. Notice the condition ||A||^ 2 (n) < implies 

/ 


E[(Jt7A.,.)2] .. ,3, 


>-/'E[|A-,'A.,| 


Therefore, 


Q.(r + A)-Q,(r) >/m-i 

i=i 


m 


E(X,'A.,)^ 1 , 

4 “ |ill^lli 2 (n)- 


2. By the decomposability of nuclear norm, A G /C(r,3), Lemma 4.1 and Assumption 4.4, we 
can estimate 

||A||. = ||Pr(A)||. + ||P,t(A)||. < 4||Pr(A)||. < 4V&||Pr(A)||F < ^||An,,n). 

Pr,3 


B.3. Proof for Lemma 4.3 

We restrict on the event P that assumption 4.2 holds. Observe that Pr^Yij — XiiV^j + A^j)} — 
boils down to 

Pr{Y, - xj (r,, + A,,)} - Pr{Y, - xj r,,} 

= {r - l{£ij < Xj A^j)}{£ij - Xj A*j) - [t - l{£ij < 0)}eij 

=£ijl{XjA^j < £ij < 0) - £ijl{0 < £ij < XjA^j) - XjA^j[T - l{£ij < XjA^j)}, (B.l) 

where r — l{sij < Xj A^j) and £ij = Yij — XjV^j are independent across i and j by Assumption 
4.1. By triangle inequality and (B.l), A{t) < B{t) + C{t) + V{t) where 

B{t) = sup 

l|A||i2(n)<t,Ae-»C(r,3) 






C{t) 


sup 

l|A||i2(n)<i,Ae-»C(r,3) 



■ ^ //6 

“ ^ £iil(0 < £ij < X^ A^j) 
^ i=i 


m = 


sup 

AeK^.s) 


Gr, 


m 


i=i 


T 


A*j{r - l{£ij < Xj A*j)} 


First, we consider B{t). Condition on X, the variable £ijl{XjA^:j < £ij < 0) lies in (Xj^A*j,0]. 
Condition on Xj, applying Hoeffding’s inequality C.2 gives 


P 



^ lit 

i=i 


< Bij < 0) 



< 2 exp 


< 2 exp 


2m?s^ 

2m?s^ \ 
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where the second inequality comes from Assumption 4.2, Lemma 4.2 (ii) and Holder’s inequality; 
more explicitly, 

n m 

<n-Hr(AXTxA) < || A||2||n-ixTx|| < «2c2|| A||i^(n)||Sx||. (B.2) 

i=i j=i 


Hence, B{t) < iog(p+m 7 probability greater than 1 — 2{p + m) 

Performing similar procedure to C{t) as for bounding B{t), we also have 

C{t) < Q’^\/'^ 2 ||Xx||log(p + m ) probability greater than 1 - 2 (p + m)"^. 
m 

Next we consider 'D{t). Condition on X,, r — l{sij < XjA.^j) is independent across i and j 
and is bounded by r V (1 — r). Using Hoeffding’s inequality C.l, 


G, 


1 

m 


C's'^m? 


i=i 

< exp — 

< exp ( 1 — 


> s 


{r V (1 - r)}n-i ELi Er=i(^7 


C's^m 


{r V (1 -r)}a2||A||2^(^)C2CJmax(Sx)/ 
where the second inequality follows from the same deduction in (B.2). Therefore, 


V{t)< 


aty/2{T V (1 - t)}c 2 ||Xx|| log(p + m) 


with probability greater than 1 — 3{p + m) 


-2 


V^m 

Summing up the results for B{t), C{t) and 'D{t), with the restriction on the event H, we obtain 


Ait) < 


2r V (1 - r) at^yc 2 \\'Sx\\ log(p + m) 
C ^ m 


with probability greater than 1 — 9(p + m) ^ — 7 ^ as e < 3. 


B.4. Proof for Lemma 4.4 


Applying the same T-net argument on the unit m dimensional Euclidean sphere S'”* ^ = {u G 
M”* : ||u ||2 = 1} as in the proof of Lemma 3 in Negahban and Wainwright (2011), we obtain 


||X^W|| > 4s^ = P 


sup -|v^X^Wu|>4s 

vgSP-l ^ 

ugSm-l 


< gP+m 


sup 

vgSP-l,ugS'"-l 
||u|| = ||v|7l 


|(Xv,Wu) 


n 



(B.3) 


To bound n ^(Xv,Wu) = ^ ^ Er=i (v, (u, Wi) , first we show the sub-Gaussianity of (u, Wi). 

Since \Wij\ < r V (1 — r). It follows by Hoeffding’s inequality C.l that 


P((u, Wi) >s)< exp 



C's^ 

{r V (i -r)}||u||2 


exp 


C's^ \ 
T V (1 — r) y 
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It can also be concluded that (see Definition 5.7 and discussion of Vershynin (2012)) ||(u, Wi)\\^^ = 
V (1 - r). 

We apply Hoeffding’s inequality C.l again to bound n~^ Wj). Conditioning on 

Xi, we have 


n 


-i^(v,x,)(u,w-,: 


i=l 


> s ) < exp — 
< exp ( 1 — 


C'ns^ 


{TV(l-r)}n-iX:r=i(v,^*)2 

C'ns^ 

{r V (1 - t)}c2\\^x 


where the second inequality follows from the fact that ||v||2 = landn ^ — 

C2 1151x11 on the event that Assumption 4.2 holds. 

To summarize, on the event that Assumption 4.2 holds, 


n 


P -||X^W|| > 4s < 8P+”"exp 1 - 


C'ns^ 


< exp 1 — 


{rV (1 -r)}c2||Sx||) 
C'ns^ 


{r V (1 - t)}c 2 ||Sx| 


+ {p + m) log 8 


Therefore, 


1 


-||X'^W|| < 4 - W21og8 


n 


,{r V (1 - r)}c2||Xx|| p + m 


C 


n 


with probability greater than 1 — 3e (p+™)'°g8 — as e < 3. 


C. Supplementary Lemmas 

Definition C.l. Let X = with inner product (A,B) = tr(A^B) and II • II be the induced 

norm. / : A —)■ M a lower semicontinuous convex function. The proximity operator of f, Sf : X ^ 
X: 

Sf(Y) arg min|/(X) + J||X - Yf I, VY e A. 

THEOREM C.l (Theorem 2.1 of Cai et al. (2010)). Suppose the singular decomposition of 
Y = UDV^ G where D is a p x m rectangular diagonal matrix and U and V are unitary 

matrices. The proximity operator S\{-) associated with A|| • ||* is 

Sa(Y) U(D - Alp^)+VT, (C.l) 

where 1pm is the p x m rectangular identity matrix with diagonal elements equal to 1. 

LEMMA C.l (Hoeffding’s Inequality, Proposition 5.10 of Vershynin (2012)). Let Xi,...,Xn be 
independent centered sub-gaussian random variables, and let K = max* ||Aj||^2- Then for every 
a = (oi,..., an)~^ G M"" and every f > 0, we have 


^ ^ Oj Aj 


2=1 


> t < e • exp — 


C't 


1+2 




where C' > 0 is a universal constant. 
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LEMMA C.2 (Hoeffding’s Inequality: classical form). Let be independent random 

variables such that Xi G [ai,bi] almost surely, then 


2=1 


> t < 2 exp — 


2t^ 




LEMMA C.3 (Wainwright (2009)). Let X G be a random matrix with i.i.d. rows sampled 
from a p-variate N(0, Sx) distribution. Then for n > 2m, we have 


<7r. 


> ^rnini'^ x) 
n - 9 


n 


, <7max( -X^X ] < 9cJmax(Xx) 


> 1 — 4exp(—n/2). 


One may see the discussion after the proof of Lemma 9 in Wainwright (2009) for details. 
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